Project 4, Explore and Summarize Data by David He

Prosper Marketplace, Inc. is a San Francisco, California based company that does peer-to-peer lending. It’s the first peer-to-peer lending marketplace in the industry, with over $7 billion in funded loans. Borrowers request personal loans on Prosper, while investors can fund the loans, considering the borrower’s credit scores, ratings, histories, and category of the loan. Prosper handles the servicing of the loan, and collects and distributes payments and interests to the investors.

For this project, Udacity provided a sample of the loan data from Prosper (last updated on 03/11/2014). The data can be downloaded here and the variable dictionary is here

With the data loaded, I’ll do a quick check on the types of variables in this large dataset.

## 'data.frame':    113937 obs. of  81 variables:
##  $ ListingKey                         : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
##  $ ListingNumber                      : int  193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
##  $ ListingCreationDate                : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ CreditGrade                        : Factor w/ 8 levels "A","AA","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
##  $ Term                               : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus                         : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ ClosedDate                         : Factor w/ 2802 levels "2005-11-25 00:00:00",..: 1137 NA 1262 NA NA NA NA NA NA NA ...
##  $ BorrowerAPR                        : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate                       : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                        : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ EstimatedEffectiveYield            : num  NA 0.0796 NA 0.0849 0.1832 ...
##  $ EstimatedLoss                      : num  NA 0.0249 NA 0.0249 0.0925 ...
##  $ EstimatedReturn                    : num  NA 0.0547 NA 0.06 0.0907 ...
##  $ ProsperRating..numeric.            : int  NA 6 NA 6 3 5 2 4 7 7 ...
##  $ ProsperRating..Alpha.              : Factor w/ 7 levels "A","AA","B","C",..: NA 1 NA 1 5 3 6 4 2 2 ...
##  $ ProsperScore                       : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ ListingCategory..numeric.          : int  0 2 0 16 2 1 1 2 7 7 ...
##  $ BorrowerState                      : Factor w/ 51 levels "AK","AL","AR",..: 6 6 11 11 24 33 17 5 15 15 ...
##  $ Occupation                         : Factor w/ 67 levels "Accountant/CPA",..: 36 42 36 51 20 42 49 28 23 23 ...
##  $ EmploymentStatus                   : Factor w/ 8 levels "Employed","Full-time",..: 8 1 3 1 1 1 1 1 1 1 ...
##  $ EmploymentStatusDuration           : int  2 44 NA 113 44 82 172 103 269 269 ...
##  $ IsBorrowerHomeowner                : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CurrentlyInGroup                   : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ GroupKey                           : Factor w/ 706 levels "00343376901312423168731",..: NA NA 334 NA NA NA NA NA NA NA ...
##  $ DateCreditPulled                   : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
##  $ CreditScoreRangeLower              : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper              : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ FirstRecordedCreditLine            : Factor w/ 11585 levels "1947-08-24 00:00:00",..: 8638 6616 8926 2246 9497 496 8264 7684 5542 5542 ...
##  $ CurrentCreditLines                 : int  5 14 NA 5 19 21 10 6 17 17 ...
##  $ OpenCreditLines                    : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ TotalCreditLinespast7years         : int  12 29 3 29 49 49 20 10 32 32 ...
##  $ OpenRevolvingAccounts              : int  1 13 0 7 6 13 6 5 12 12 ...
##  $ OpenRevolvingMonthlyPayment        : num  24 389 0 115 220 1410 214 101 219 219 ...
##  $ InquiriesLast6Months               : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ TotalInquiries                     : num  3 5 1 1 9 2 0 16 6 6 ...
##  $ CurrentDelinquencies               : int  2 0 1 4 0 0 0 0 0 0 ...
##  $ AmountDelinquent                   : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years            : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years           : int  0 1 0 0 0 0 0 1 0 0 ...
##  $ PublicRecordsLast12Months          : int  0 0 NA 0 0 0 0 0 0 0 ...
##  $ RevolvingCreditBalance             : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization                : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ AvailableBankcardCredit            : num  1500 10266 NA 30754 695 ...
##  $ TotalTrades                        : num  11 29 NA 26 39 47 16 10 29 29 ...
##  $ TradesNeverDelinquent..percentage. : num  0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
##  $ TradesOpenedLast6Months            : num  0 2 NA 0 2 0 0 0 1 1 ...
##  $ DebtToIncomeRatio                  : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ IncomeRange                        : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ IncomeVerifiable                   : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome                : num  3083 6125 2083 2875 9583 ...
##  $ LoanKey                            : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
##  $ TotalProsperLoans                  : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ TotalProsperPaymentsBilled         : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ OnTimeProsperPayments              : int  NA NA NA NA 11 NA NA NA NA NA ...
##  $ ProsperPaymentsLessThanOneMonthLate: int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPaymentsOneMonthPlusLate    : int  NA NA NA NA 0 NA NA NA NA NA ...
##  $ ProsperPrincipalBorrowed           : num  NA NA NA NA 11000 NA NA NA NA NA ...
##  $ ProsperPrincipalOutstanding        : num  NA NA NA NA 9948 ...
##  $ ScorexChangeAtTimeOfListing        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanCurrentDaysDelinquent          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LoanFirstDefaultedCycleNumber      : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LoanMonthsSinceOrigination         : int  78 0 86 16 6 3 11 10 3 3 ...
##  $ LoanNumber                         : int  19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
##  $ LoanOriginalAmount                 : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ LoanOriginationDate                : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
##  $ LoanOriginationQuarter             : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
##  $ MemberKey                          : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
##  $ MonthlyLoanPayment                 : num  330 319 123 321 564 ...
##  $ LP_CustomerPayments                : num  11396 0 4187 5143 2820 ...
##  $ LP_CustomerPrincipalPayments       : num  9425 0 3001 4091 1563 ...
##  $ LP_InterestandFees                 : num  1971 0 1186 1052 1257 ...
##  $ LP_ServiceFees                     : num  -133.2 0 -24.2 -108 -60.3 ...
##  $ LP_CollectionFees                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                          : int  258 1 41 158 20 1 1 1 1 1 ...

Since ProsperScore and ProsperRating..Alpha can only take a limited number of different values, I’m converting them into factor variables and rearranging them to display better.

##  Ord.factor w/ 11 levels "1"<"2"<"3"<"4"<..: NA 7 NA 9 4 10 2 4 9 11 ...
##  Ord.factor w/ 7 levels "HR"<"E"<"D"<"C"<..: NA 6 NA 6 3 5 2 4 7 7 ...

This dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

Univariate Plots Section

##    HR     E     D     C     B     A    AA 
##  6935  9795 14274 18345 15581 14551  5372

It seems the loan ratings are normally distributed, with “C” ratings being the most frequent.

##     1     2     3     4     5     6     7     8     9    10    11 
##   992  5766  7642 12595  9813 12278 10597 12053  6911  4750  1456

It seems the distribution of Prosper Scores is similar to the distribution of Prosper Rating. The most concentrated area is between scores 4 to 8.

The majority of the borrowers are within the $25k to $75k range. The surprising thing is that within this dataset, people within the range of $1-24,999 did not borrow as frequently as any other group. One would think these people need the most financial help. Perhaps Prosper did not go over this segment of people.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

It seems most people took loan amounts of under $10,000. One interesting observation is that the # of loans spike at $5,000 intervals, as seen in $10,000, $15,000, $20,000, etc. Is it possible people lean towards amounts by the $5,000s? Or perhaps Prosper has a selection list of amounts that are multiples of $5,000, and let customers specify amounts if their desired amount is not in the selection list?

##    AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA    HI 
##   200  1679   855  1901 14717  2210  1627   382   300  6720  5008   409 
##    IA    ID    IL    IN    KS    KY    LA    MA    MD    ME    MI    MN 
##   186   599  5921  2078  1062   983   954  2242  2821   101  3593  2318 
##    MO    MS    MT    NC    ND    NE    NH    NJ    NM    NV    NY    OH 
##  2615   787   330  3084    52   674   551  3097   472  1090  6729  4197 
##    OK    OR    PA    RI    SC    SD    TN    TX    UT    VA    VT    WA 
##   971  1817  2972   435  1122   189  1737  6842   877  3278   207  3048 
##    WI    WV    WY  NA's 
##  1842   391   150  5515

Nothing too interesting here - states with large cities have more people, and therefore account for more loans than states with smaller cities.

## 
##     0     1     2     3     4     5     6     7     8     9    10    11 
## 16965 58308  7433  7189  2395   756  2572 10494   199    85    91   217 
##    12    13    14    15    16    17    18    19    20 
##    59  1996   876  1522   304    52   885   768   771

## 
##      Not Available Debt Consolidation   Home Improvement 
##              16965              58308               7433 
##           Business      Personal Loan        Student Use 
##               7189               2395                756 
##               Auto              Other      Baby&Adoption 
##               2572              10494                199 
##               Boat Cosmetic Procedure    Engagement Ring 
##                 85                 91                217 
##        Green Loans Household Expenses    Large Purchases 
##                 59               1996                876 
##     Medical/Dental         Motorcycle                 RV 
##               1522                304                 52 
##              Taxes           Vacation      Wedding Loans 
##                885                768                771

One category stands far above the rest, and that is Debt Consolidation. This makes sense, since a lot of people in debt can potentially have high-interest loans from elsewhere. Getting a great rate from Prosper could save on massive amount of interest from those high-interest loans.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00   44.00   80.48  115.00 1189.00

It seems a sizeable number of loans are bought out by individual investors, with the majority of the loans bought out by fewer than 100 investors. Not quite surprising, since loan amounts (as seen before) are usually less than $10,000, which can be covered a single investor comfortably.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.0100  0.1242  0.1730  0.1827  0.2400  0.4925

Based on the plot, most yields are between 5% to 35%, with most of them concentrated near 17%.

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

It seems the majority of all loans in the dataset are in good standing. If anything else, this reflects nicely on Prosper as a platform for peer-to-peer lending, since many people, including me, had doubts about the safety of investing in non-traditional loans.

Univariate Analysis

What is the structure of your dataset?

The Prosper Loans dataset has 113937 observations and 83 variables. The variables contain 3 classes - numeric, factor, and int. The variables I explored in the Univariate Plotting section included the

ProsperRating: Factor variable with 7 levels ProsperScore: Factor variable with 11 levels IncomeRange: Factor variable with 8 levels LoanOriginalAmount: Integer variable BorrowerState: Factor variable with 51 levels ListCategory: Factor variable with 20 levels Investors: Integer variable LenderYield: Numeric variable Status: Factor variable with 4 levels

What is/are the main feature(s) of interest in your dataset?

I want to determine whether the Loan Status of a loan is connected or affected by certain variables, such as yields or whether the income or ratings of the loaners are good. I also want to check out how the Lender Yield is affected by things like credit scores and Prosper scores.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Credit scores and loan terms can be considered as well. Since credit score is tied to how consistent borrowers can pay back what they borrowed, it would make sense for investors to invest in loans by borrowers with good credit scores, or else borrowers with bad credit score can just default or delay payment. On the other hand, loan terms may be important to certain people, whether they are looking for something long term or short term.

Did you create any new variables from existing variables in the dataset?

Yes, I introduced the ProsperRating and Status variables, and set them to be ordered factor variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There were some unusual distributions, such as Number of Loans by Investors and Number of Loans by Loan Amount. These two are mostly skewed to the right.

I did factorize and ordered a few variables. I did this because statistical models treat numeric and factor variables differently, as well as unordered and ordered factor variables. To make sure the models calculate using the correct method, I had to make sure the variables I’m investigating are of the right data type.

Bivariate Plots Section

##                    LoanOriginalAmount  Investors LenderYield
## LoanOriginalAmount          1.0000000  0.3800926  -0.3284551
## Investors                   0.3800926  1.0000000  -0.2741739
## LenderYield                -0.3284551 -0.2741739   1.0000000

A quick correlation matrix shows that the continuous variables LoanOriginalAmount, Investors, LenderYield, CreditScoreRangeAvg, and Term do not have a strong correlation amongst each other. I do want to check out the relationship (if any) between Score and Rating with the Investors variable.

Based on this scatterplot, it seems rating and score have a positive linear relationship; higher ratings tend to result in higher scores. I suspect that the relationships between Investors and both Rating and Score should be similar. Let’s find out.

## prosper$ProsperScore: 1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   16.00   35.00   40.78   58.00  293.00 
## -------------------------------------------------------- 
## prosper$ProsperScore: 2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0    10.0    30.3    45.0   470.0 
## -------------------------------------------------------- 
## prosper$ProsperScore: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    9.50   35.24   52.00  483.00 
## -------------------------------------------------------- 
## prosper$ProsperScore: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    6.00   36.88   53.00  833.00 
## -------------------------------------------------------- 
## prosper$ProsperScore: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00   26.00   51.72   74.00 1024.00 
## -------------------------------------------------------- 
## prosper$ProsperScore: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00   30.00   59.62   88.00  821.00 
## -------------------------------------------------------- 
## prosper$ProsperScore: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00   33.00   67.78  104.00 1035.00 
## -------------------------------------------------------- 
## prosper$ProsperScore: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     8.0    70.0   105.7   162.0  1189.0 
## -------------------------------------------------------- 
## prosper$ProsperScore: 9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    23.0    92.0   121.5   183.0   659.0 
## -------------------------------------------------------- 
## prosper$ProsperScore: 10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0    88.0   133.6   206.0   779.0 
## -------------------------------------------------------- 
## prosper$ProsperScore: 11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0     1.0    94.3   169.0   714.0
## # A tibble: 12 x 2
##    ProsperScore     n
##           <ord> <int>
##  1            1   992
##  2            2  5766
##  3            3  7642
##  4            4 12595
##  5            5  9813
##  6            6 12278
##  7            7 10597
##  8            8 12053
##  9            9  6911
## 10           10  4750
## 11           11  1456
## 12           NA 29084

It seems that as we go higher on the Prosper Scoring scale of 1 through 11, the median of number of investors for a given loan goes up. The interesting thing is at the score of 11, the median is 1. One explanation may be that since the score is so high and the loan yield is almost guaranteed, individual investors buy out the entirety of the loan amount.

This plot maintains my original claim that better score/rating usually results in higher number of investors. Now let’s examine a few other relationships.

This graph is showing a peculiar relationship: it seems that as the income level goes up, there is more money borrowed in a loan. One would think having a higher income would reduce the need to take out a loan for high value assets like houses, cars, etc. But quite on the contrary, the relationship between income and loan amount is direct, not inverse. Perhaps people of different income levels have different priorities - people with high income may categorize things like starting a business or other expensive ventures as high priority. People with low income would probably categorize paying off debt as high priority. It’s also possible that people with higher income tend to buy the same things as others, but higher quality, hence a slightly increase in median loan amount. This prompts a Loan Category vs Income Range exploration.

Judging by this chart, debt consolidation is a huge piece of the puzzle, for people from all socio-economic classes. Without further debt data to breakdown, it’s hard for me to tell why people with high income have to take out larger loans than people with lower income. Is it their spending habit? Or perhaps in certain areas (Bay Area), $100k salary is considered only average? Moving on.

## [1] -0.2741739

As previously observed in the matrix table, higher lender yield actually results in a slight decrease in number of investors. Perhaps higher yield usually means higher risk, and it’s a common phenomenon that more people are risk-averse than not.

Having high income doesn’t neccessarily mean the borrower will not default. Having high Prosper Rating and Scores does seem to affect the number of defaults - the occurances of Past Dues and Defaults seem to become very rare at very high score/rating (when score = 10 or 11, rating = A or AA). This, plus the original plot of “Number of Investors vs Prosper Rating”, makes a lot of sense, since investors would not appreciate their investments go down the drain.

We now see a much clearer picture. Higher scores usually result in lower yield, since there is lower risk involved.

By looking at the boxplots, the picture becomes much more clear. I will discuss more about the relationships in the Bivariate Analysis section below.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Based on the last bivariate plot, when comparing good-standing loans to loans that are not, I found that: 1) Loans that are in good-standing have higher median loan amount, 2) Loans that are in good-standing have fewer median number of investors, 3) Loans that are in good-standing have lower median yield, 4) The people requesting loans that are in good-standing have a higher median credit score, and 5) No major difference in loan duration between good-standing loans and the delinquent ones.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Some other interesting relationships I found are: * Median Yield goes down as Prosper Score goes up * Number of Investors goes down as Yield goes up * Out of all the listing categories, a huge chunk of loans belong to the “Debt Consolidation” category, and also people who are relatively well-off (salary of $75k or above) have most of their loans in this category. * Loan amount increases as income level increases * Median number of Investors goes up as Prosper score or rating goes up

What was the strongest relationship you found?

The strongest relationship is definitely Median Lender Yield vs Prosper Score. It confirms our common knowledge that the higher the Prosper Score, the lower the yield. “High risk, high reward”, as people always say.

Multivariate Plots Section

What I discussed before in the bivariate analysis section still can be seen here: as Prosper Score increases, yield decreases for loans of all statuses. Although, loans that are in good-standing have slightly lower yield than the others across the board. This is probably because there are other variables affecting yield, such as current delinquent amount or credit scores.

Similar to the previous plot, yield decreases as credit score increases, for all loan statuses. Once again, loans that are in good-standing have lower yield than loans that are delinquent.

This graph brings out some new insight on credit scores and how it is a relatively good indicator of whether a loan is going to go bad. The median credit score of borrowers with loans that are either defaulted or past due do not surpass 690, where as the median credit score of borrowers with loans that are in good-standing surpass 690, and even more as income level increases.

## 
## Calls:
## m1: lm(formula = LenderYield ~ ProsperScore, data = prosper)
## m2: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount, 
##     data = prosper)
## m3: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount + 
##     CreditScoreRangeAvg, data = prosper)
## m4: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount + 
##     CreditScoreRangeAvg + AmountDelinquent, data = prosper)
## m5: lm(formula = LenderYield ~ ProsperScore + LoanOriginalAmount + 
##     CreditScoreRangeAvg + AmountDelinquent + Status, data = prosper)
## 
## ======================================================================================================
##                                          m1           m2           m3           m4           m5       
## ------------------------------------------------------------------------------------------------------
##   (Intercept)                          0.184***     0.213***     0.496***     0.495***     0.531***   
##                                       (0.000)      (0.000)      (0.003)      (0.003)      (0.003)     
##   ProsperScore: .L                    -0.219***    -0.192***    -0.165***    -0.165***    -0.159***   
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: .Q                    -0.010***    -0.009***    -0.006***    -0.006***    -0.008***   
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: .C                    -0.003**      0.003**      0.002*       0.002*       0.007***   
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: ^4                     0.026***     0.030***     0.021***     0.021***     0.016***   
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: ^5                     0.005***     0.006***     0.003***     0.003***     0.005***   
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: ^6                    -0.006***    -0.004***    -0.008***    -0.008***    -0.009***   
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: ^7                     0.004***     0.002**      0.002***     0.002***     0.002**    
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: ^8                     0.005***     0.003***     0.003***     0.003***     0.002***   
##                                       (0.001)      (0.001)      (0.001)      (0.001)      (0.001)     
##   ProsperScore: ^9                    -0.005***    -0.004***    -0.006***    -0.006***    -0.006***   
##                                       (0.001)      (0.001)      (0.000)      (0.000)      (0.000)     
##   ProsperScore: ^10                    0.008***     0.007***     0.007***     0.007***     0.007***   
##                                       (0.001)      (0.001)      (0.000)      (0.000)      (0.000)     
##   LoanOriginalAmount                               -0.000***    -0.000***    -0.000***    -0.000***   
##                                                    (0.000)      (0.000)      (0.000)      (0.000)     
##   CreditScoreRangeAvg                                           -0.000***    -0.000***    -0.000***   
##                                                                 (0.000)      (0.000)      (0.000)     
##   AmountDelinquent                                                            0.000***     0.000***   
##                                                                              (0.000)      (0.000)     
##   Status: Past Due/Defaulted                                                              -0.025***   
##                                                                                           (0.001)     
##   Status: Current or Paid/Defaulted                                                       -0.047***   
##                                                                                           (0.001)     
## ------------------------------------------------------------------------------------------------------
##   R-squared                                0.440        0.504        0.556        0.557        0.585  
##   adj. R-squared                           0.440        0.504        0.556        0.557        0.585  
##   sigma                                    0.056        0.053        0.050        0.050        0.048  
##   F                                     6664.472     7823.693     8869.210     8191.290     7959.190  
##   p                                        0.000        0.000        0.000        0.000        0.000  
##   Log-likelihood                      124404.571   129521.286   134299.039   134311.792   137080.870  
##   Deviance                               264.690      234.618      209.631      209.568      196.326  
##   AIC                                -248785.142  -259016.573  -268570.077  -268593.584  -274127.741  
##   BIC                                -248672.958  -258895.040  -268439.196  -268453.354  -273968.813  
##   N                                    84853        84853        84853        84853        84853      
## ======================================================================================================

The linear model I built is to predict the yield of a loan, with the formula (LenderYield ~ ProsperScore + LoanOriginalAmount + CreditScoreRangeAvg + AmountDelinquent + Status). As I was adding these variables (that I thought may help me explain the variability of the response data), I kept an eye on the R-squared value. With those 5 predictor variables, the model can explain about 59% of the variability of the data, which is quite decent.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Based on the median yield vs Prosper Score by Loan Status plot, loans with lower Prosper Scores usually result in higher yield as well. Same deal when looking at the yield vs credit score plot; loans with lower credit scores usually result in higher yield. However, when specifying a third variable - Loan Status - to both plots, we can see that loans in good-standing have lower yields than loans that are delinquent, given the same scores.

Out of curiosity, I generated a linear regression model to explain the variability of the data, with minor success. The variables (ProsperScore, LoanOriginalAmount, CreditScoreRangeAvg, AmountDelinquent, Status) chosen by me made sense in predicting the Lending Yield of loans, and as I added them one-by-one to the linear model, I knew each of these variables played a part in predicting the yield, since my R-Squared value increased after each addition.

Were there any interesting or surprising interactions between features?

No. The relationships all make sense for the most part.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a linear model for this dataset to predict Lender Yield.

The strengths of my model include: 1. The R squared value is relatively high, at 0.585, which means it explains about 59% of the variability in this dataset. 2. I kept the number of predictor variables low, a total of 5. This should alleviate the concerns of overfitting.

The limitations of my model include: 1. My model is sensitive to outliers. This is twofold: I did not take measures to clean the dataset of any outliers when calculating the linear model, therefore the model may be affected drastically if the outliers are significant. Also, since the model explains the variability of the data in a set range, if I were to try and predict the yield of something outside of that range, the extrapolation may not be accurate. 2. I did not explore any other modeling options, such as polynomial regression, which may create a better model and predict the yield better. 3. I did not transform any of the predictor variables into logs or square roots, which may create a better model and predict the yield better.


Final Plots and Summary

Plot One

Description One

This is a multivariate plot, summarizes the relationship between Lender Yield, Prosper Score, for each Loan Status. Loans with the best status - Current or Paid - has lower median Lender Yield than loans with ‘Past Due’ or ‘Defaulted’ statuses. This graph captures the conventional saying of “high risk, high reward”.

Plot Two

Description Two

This stacked bar chart gives us a glimpse of why people go onto prosper.com and ask others for loans. The majority of loans taken out from Prosper is for “Debt Consolidation”. Out of this category, almost all of the loans are taken out by people with income. It’s also interesting to see quite a few people who have $100k salary take out loans for consolidating debts.

Plot Three

Description Three

The boxplot shows how attractive loans with high scores are. Judging by the medians of each score, increasing the score usually means more investors funding the loans. The number of investors funding an ‘AA’ loan is drastically higher than loans with any other scores, including ‘A’.


Reflection

This was a large dataset with over 80 variables, so I felt going through the entire analysis process was quite an accomplishment. With barely any knowledge in the financial industry, I was surprised how far I could get with common sense, a large dose of curiosity, and R coding skills.

One of the challenges I faced while analyzing this dataset was understanding what each of the variable stood for and their meaning. After looking through the variable dictionary a few times, I let my common sense kick in and picked a few variables that should have some obvious relationships, and go from there. Another challenge was choosing which type of visualization would prove most useful to display the relationships amongst variables, and constantly going through stackoverflow tips on using different functions for ggplot2.

As I made progress in the analysis, I quickly realized that there was a lack of a main feature of interest. Depending on who is in possession of this dataset, that person may see a feature of interest that is completely different from the next person. Could it be whether a borrower is delinquent? The amount that is delinquent? Predict whether the status of a loan is good or not? Predict the lender yield? Or is it to figure out how Prosper rates and scores each loan? By the time I moved onto bivariate analysis, my original feature of interest, “Loan Status”, kind of shifted to “LenderYield”, and I had to make adjustments to previous graphs, summaries, and transformations, etc. In short, staying focused on 1 single feature of interest was very hard. However, once I made that shift, I explored potential relationships that are reasonable, and created a linear model that does a decent job predicting the Lender Yield.

One thing I realized later in the analysis was I could perhaps transform or combine variables, and utilize other types of models, like logistic regression to predict categorical varialbles such as Loan Status. Other data manipulation techniques such as checking and transmuting outliers may also help with the predictive power of my current linear model.